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We find a novel correlation structure in the residual noise of stock market returns that is remarkably linked 
to the composition and stability of the top few significant factors driving the returns, and moreover indicates 
that the noise band is composed of multiple subbands that do not fully mix. Our findings allow us to construct 
effective generalized random matrix theory market models |3, 4] that are closely related to correlation and 
eigenvector clustering |6, 12]. We show how to use these models in a simulation that incorporates heavy tails. 
Finally, we demonstrate how a subtle purely stationary risk estimation bias can arise in the conventional cleaning 
prescription (J. 



Introduction: Originally started in the context of nuclear 
physics [1], random matrix theory (RMT) has thereafter found 
numerous applications in a variety of fields such as number 
theory, disordered systems, neural networks, and signal pro- 
cessing HI 0] . Recently the pioneering work of Laloux et al 
J^tl, as well as much subsequent research Etl> have shown 
that RMT can also be a valuable tool for analyzing stock mar- 
ket correlations, where noise can account for more than 2/3 of 
the eigenvalue spectrum, and a typical large portfolio has size 
comparable to the measurement time frame. Thus, much of 
the empirical eigenvalues are spurious and represent measure- 
ment noise and biases. The remarkable insight provided by 
Laloux et al was to show that a suitable fit to RMT can clean 
these spurious contributions, and moreover identify the statis- 
tically significant signal, or common market risk factors that 
drive the individual stock returns. The most prominent such 
non-idiosyncratic factor is the nearly equal-weight top eigen- 
vector, whose eigenvalue is more than 20 times bigger than 
the average spectrum. Secondary factors, are long-short port- 
folios of certain liquidity [3] and industry structure 001, but 
their contribution is typically an order of magnitude smaller. 
Most of the rest of the eigenvectors are unstable in time, ap- 
pear random, and their spectral contribution can be fitted to the 
Marcenko-Pasteur (MP) distribution [8] derived in the context 
of Gaussian RMT (GRMT). The noisy eigenvalue correlations 
J3l also agree with theory [1]. These results have been veri- 
fied over many stock selections, as well as return frequencies 

iim. 

Despite the apparent success of the theory, subsequent re- 
search suggests several empirical aspects that the original 
RMT cleaning may not account for properly. (1) Tails and 
their correlations have non-trivial effects, and are known to 
both broaden the spectrum above the upper noise-band edge, 
as well sharpen it near the lower edge 1J, |9l 1 1 Oil . thus mak- 
ing the fit to the MP distribution problematic. The above re- 
distribution of spectral weight appears in conjunction with an 
enhancement of the inverse participation ratios around both 
ends of the noise spectrum, the so-called localization effect 
|0], unlike GRMT where the participations are flat |1]. (2) In 
addition to being partially localized, the band itself may be 
split due to the same separation of correlation scales 111 111 that 
is thought to give rise to clustering of stocks between indus- 



tries 01211 . So far this effect has not been observed, however, 
due to the large amount of mixing that depletes the stability 
of all the noisy eigenvectors. It is important to empirically 
distinguish between the single and multiple band cases. (3) 
Non-stationarity effects are insufficiently understood. They 
are suggested O, |4J] to be the source of a residual bias in the 
risk estimates obtained after RMT cleaning. However, in light 
of the abovementioned considerations, it is not clear that the 
original cleaning procedures are unbiased to begin with. 

In this work, we consider both TV = 484 2 minute S&P500 
TAQ midquote returns between June 20 - Sep 20, 2007, as 
well as N = 451 daily S&P500 returns between Jan 2001- 



Dec 2007 [13]. (1) We reveal a novel correlation structure of 
the residuals that is linked to the structure and stability of the 
top few empirical factors. Mainly, we find that the inverse par- 
ticipations of the localized edge-eigenmodes of the band are 
dominated by the outlier stocks in the composition of the top 
few factors, thus indicating that most of the noise there is due 
to these stocks. The upper edge fluctuations are mainly due 
to weakly correlated stocks with the smallest relative weight 
in the market portfolio while lower edge fluctuations are due 
to strongly correlated stocks identified as the outliers in the 
secondary factors. The groups in the lower edge belong to 
major industrial sectors |4j, Il2ll . while the upper edge con- 
tains a large diversified portfolio of medium to small liquidity 
stocks. Moreover, because we find these groups to be dis- 
joint, we conclude that as long as the top few factors are stable 
and distinct, the noise band is composed of multiple subbands 
that do not fully mix. (2) We pinpoint the effective positive- 
definite cleaned matrices that exhibit the multi-residual and 
factor structure above to be the hierarchical RMT models 
which are closely related to coarse-grained "real space" mod- 
els of market clustering 11211 . and fundamentally arise out of 
correlation scale separation. (3) We use these effective mod- 
els to perform a one-factor stochastic volatility |0] simulation 
in order to take into account the effect of tails and their corre- 
lations. (4) We show how conventional cleaning can give rise 
to a subtle purely stationary risk-estimation bias. 

Empirical Results: Given our set of N stock series Si(t), 
t = 1, . . . , T, from their log returns xt.% = \og(St,i/ St-i,i) 
we calculate the empirical correlation Ce — H\ x i x i) — 

(xi}(xj))/o-i<jj where Ui = y/ (xf) - (x t ) 2 . According to 
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Figure 1: Main: A fit of the MP distribution to 2 min data for AT = 
484, stocks in the S&P 500 with T = 3N yields Q ef s = 2.25, and 
<J e ff = 0.67. The top three eigenvalues, Ai = 152.9, A2 = 8.2, 
A3 = 7.6, A4 = 5.3, A5 = 5.2 were omitted from the plot due to 
their scale. Inset: The top three eigenvectors, e^i, k = 1, 2, 3, with 
their entries i sorted by decreasing liquidity (from left to right). Note 
the significant outliers in each as emphasized by the horizontal 
lines. 



RMT lU, 0, H], if Ce were obtained from a purely random 
signal of bounded variance a whose marginal tails are not too 
heavy H, then in the limit N -> 00 with Q = N/T fixed, the 
correlations will self-average and will have an asymptotically 
deterministic eigenvalue spectrum given by the MP distribu- 
tion 18D: 
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In the above, the eigenvalues A are restricted to lie within 
the hard-edge spectral band, A e [A_,A+], with A± = 
X±(Q, a) = a 2 {I ± y/l/Q) 2 . One can interpret dTJ as the the 
finite T/N noise-induced broadening and bias away from the 
underlying trivial spectrum p c iean (A) = 5(X—1) thatpMp(A) 
reduces to in the limit Q — > 00. 

Of course, stock market correlations are not purely random, 
so a fit for a and Q is necessary in © if one wants to iden- 
tify the trully residual part of the spectrum 0]. In Fig. 
1 we show such a fit to the noisy region of the 2 min data 
that yields a e ff = 0.67, Q e ff — 2.25. Note that much of 
the spectrum lies outside the MP band. In the inset of Fig 1 
we plot the composition, of the top three eigenvectors {e^}, 
k = 1,2,3 sorted by decreasing liquidity. Unlike the predic- 
tion of GRMT where should be a mean-zero unit Gaus- 
sian, there are clear deviations from such behavior in all three 
eigenvectors, as emphasized in the inset. In fact, ei has non- 
zero mean, {eu)i = 0.044 ~ TV -1 / 2 , representing a long- 
only market portfolio, while e-i and e3 represent long-short 
portfolios. All three ek can be interpreted as significant com- 



mon factors. Furthermore, as is clear from the plot, the out- 
liers in the factor composition have certain liquidity structure. 
In the case of e2 and in of Fig 1, these outliers can be iden- 
tified with major sectors such as financials, oil, and utilities, 
whose correlations are relatively stable in time JfJ 0, EH] • 

Despite the appearance of factors, one expects the random 
residual spectral contribution to be well fitted to RMT. How- 
ever, there are important issues with the fit to pmp that one 
needs to address. Heavy tails in the multivariate distribution 
of Xt,i have non-trivial effects. They are known to broaden 
the spectral weight above the upper edge, as well as sharpen 
it near the lower edge HHHQii, both features readily notice- 
able in Fig. 1 as well as in daily data yfl. Moreover, such 
tails tend to induce outliers in the composition of the prin- 
cipal components near the band edge B causing deviations 
from the standard Gaussian distribution of the composition ex- 
pected by GRMT and inducing localization. Indeed, just as in 
daily data [01 we see in the 2 min returns that the eigenvectors 
are localized at both ends of the noise spectrum by comput- 
ing the inverse participation ratio J3l, Ik = 2~2iLi[ e ki] 4 ' f° r 
each eigenvector e/c. Intuitively, the participation Pk = 1/Ik 
scales as the number of non-trivial entries in a normalized e^: 
Pk = N for equal weight vectors, while P/. = 1 for a single 
non-trivial weight. As evident in Fig 2 (a), the participation 
is strongly localized near the band edges indicating that the 
eigenvectors there are dominated by outliers. 

A nice trick that avoids estimating the effects 
of tails is to clean the noisy eigenvalues = 
diag(Xi, . . . , X K ; {Xnoise}) of C E = S e ^eS' e with a 
flat band with scale proportional to a e ff while working 
in the original eigenvector basis Se = (ei,...,ejy) yd, 
thus obtaining a "filtered" matrix |4[]. In fact, unless the 
empirically measured tail fluctuations significantly break the 
rotational invariance implied by cleaning with a flat band, 
a e ff can be thought of as the overall scale of the residual 
noise that one can tune even without fitting to the Gaussian 
formula. The feasibility of this "filtering" procedure can also 
be justified with the resulting significant improvement of the 
portfolio risk estimates that one obtains with cleaning fll^l]. 

We will now show, however, that because of a novel struc- 
ture of the eigenvectors, cleaning with a flat band er e / f is in- 
consistent with their symmetry. Suppose we look at the fol- 
lowing group of stocks, {G\, Gjf, G^}, k = 2, 3 selected so 
that contains the top/bottom outliers \eui\ > e-k of &k 
above/below a certain threshold % (see Fig 1 inset), G\ are 
the outliers of smallest absolute weight in ei, and G are all 
the other stocks. We find that for reasonable threshold values, 
efe ~ 1.5<7 efc , Gi is disjoint from G23 = (J G^ . Moreover, the 
relative contribution of each group G to the inverse participa- 
tion, 
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is inhomogeneously distributed across the noise band as 
shown in Fig 2 (b), so that G\ contributes mostly to the upper 
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Figure 2: Top: The participation of the eigenvectors e*,, fc > 2 for 
the data in Fig 1 exhibits localization. Flat horizontal line represents 
pGMRT _ jyy 3 _ [J Bottom: Relative inverse participation .r£ g) 
for the groups Gi (red), G23 (green) and G x (blue) denned in the 



text. Flat lines represent R k 
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Figure 3: The same quantities as in Fig 2, except all the data was 
simulated from the effective model ((3) by taking into account tail 
effects in the multivariate distribution via the one-factor stochastic 
volatility model |9] with tail index v — 3. 



edge while G23 contributes mostly to the lower edge. Because 
this behavior is inconsistent with homogeneous cleaning, we 
interpret it as an indication that the noise band is composed 
of multiple subbands that do not mix. In particular, the three 
groups above form a partition of all the stocks, where the num- 
ber of assets in each subgroup, or the group degeneracies, for 
the data in Fig 1 are {D 1} D 23 , D^} = {46, 61, 377}. 

Constructing Coarse-Grained Effective Models: The par- 
tition above is reminiscent of partitions previously obtained 
by "real space" hierarchical clustering of stocks into indus- 
tries 111211 . which can be thought of as arising from a coarse- 
grained separation of correlation scales, also known to give 
rise to multiple subbands in the spectrum 

HI]. 

Moreover, it 

has been observed [6] that clustering of significant eigenvector 
components results in the same industries as those obtained 
from the real space procedure, and arises out of a mean-field 
duality relation between the two approaches [11711 . Therefore, 
we interpret the multiple subband structure above as arising 
from a particular type of underlying separation of correlation 
scales apparent at the time scales of measurement of Ce- We 
have observed the above multiple band structure in both 2 min 
and daily data. 

Let us gain insight into the details of this correlation struc- 
ture for the case of the data in Fig 1 . By clustering analysis 
IU2I1 we find that G23 separates further, G23 = {G^,G^" 3 }, 
into two nearly-equally large groups of distinctly higher/lower 
mean average correlation with degeneracies {D^, D^ 3 } = 
{29, 32} respectively. For this sample, we find that G^ con- 
tains Electric Utilities, as well as Oil & Gas Drilling & Ex- 
ploration stocks, while G^ 3 contains Oil & Gas Exploration, 
as well as some Financial stocks. From the average correla- 
tion between all four groups, we thus construct the following 
"minimal" coarse-grained effective model: 
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(3) 

which can be readily checked to be positive definite. 
The entries of each D x D diagonal block above are 
C DxD = (1 - PG )I D + P ^ D , with {pt 3 ,P2 3 ,P±,Pi} = 
{0.59, 0.48, 0.32, 0.13} respectively being the average corre- 
lation of each of the four groups G and p® x D is a block whose 
entries are pa- One can check that (01 also gives rise to 4 dis- 
tinct factors and 4 distinct subbands. 

Note that unlike the strongly correlated ones in G23, the 
stocks in G\ are typically not easily detectable with conven- 
tional hierarchical clustering approaches 11211 . although they 
are distinctly visible if one looks at the top factor (see Fig 1 
inset). Indeed, being weakly correlated between each other 
as well as with the rest of the market, these stocks will not 
appear in localized real-space clusters but instead will group 
with other stocks in later stages of the hierarchy. At the same 
time, both the degeneracy D\ and overal risk contribution of 
Gi are comparable to those of the localized sectors, as also 
directly suggested by Fig 2 (b). To properly account for the 
separation of correlation scales in markets, one must also in- 
clude the contribution of the weakly correlated stocks. 

Simulating with tails: A check of the validity of the effec- 
tive model ([3]i is ultimately provided if one can reproduce the 
empirical spectrum and participations through simulation. To 
do so, one must properly take into account heavy tailed be- 
havior of actual returns. It is known that such tails can be in- 
duced by heteroskedasticities of the underlying stock volatil- 
ities BI4II . although the details of the correlations of such 
volatility dynamics are not well understood. We thus use the 
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simplest multivariate conditional Gaussian model with one- 
factor variance-gamma volatilities, which is known to produce 
a Student-t type of series [9] for the joint returns. We use an 
inverse-gamma tail index of v ~ 3. The resulting spectrum 
agrees well with the empirical one. Moreover, com- 
paring Figs 2 and 3, we see that the inverse participations are 
also in good agreement. 

A Subtle Stationary Bias: The discussion so far suggests 
that even for stationary data, RMT cleaning could produce bi- 
ased risk estimates. Let us demonstrate this for the simplest 
case of multivariate Gaussian returns simulated with the ef- 
fective model ||3}. Without loss of generality we normalize 
the returns to mean zero unit variance. Using the notation 
in 0], the predicted risk of a portfolio w = (u> l7 . . . , wn) 



is m = 



The portfolios we look at are 



equal-weight average representatives of different subbands K, 
wk = TlkeK e fe' wnere { e fe} arc the eigenvectors of the 
effective model ([3). Moreover, instead of a "budget con- 
straint" |4[], we impose a "risk constraint" by normalizing wk 
to unit norm. We then compute at every forecasting period 
the relative difference between realized and predicted risk, 
Sr = (Qj? - fi£)/n£. For the subbands {Ki,K^, K 23 } cor- 
responding to the groups of stocks that enter in Fig 2 (b), we 
find respective biases 5n = {26 ± 4%, 2 ± 4%, -17 ± 2%}. 
Note that although Sri and 8r2z are significant, they are of 
opposite sign. Indeed, we have checked that all three contri- 
butions nearly cancel when one looks at the relative realized 
versus predicted risk of the entire noise band, Sr a u = 2 ± 3%. 
Finally, we also observe significant biases Sri an d ^ r 23 in 
the actual data. However, in this case, there are subtleties 
in disentangling the effects of multiple bands, tails, and non- 
stationarity. We postpone discussing these effects, as well as 
multi -residual generalizations of the RMT cleaning procedure 
to later work 11711 . 

Summary: In conclusion, we have found strong evidence 
that instead of homogeneous, the stock market correlation 
residuals are composed of multiple subbands that do not fully 
mix. This structure is manifested through an asymmetry in 
the relative inverse participations of the eigenvectors within 
the noise band, which is inconsistent with purely symmetric 
cleaning that doesn't distinguish between different parts of 



the noise spectrum. The multi-residual picture above natru- 
ally emerges from market models with multiple correlation 
scales, that we have identified and simulated. As a direct con- 
sequence, the scale separation within the noise band also pro- 
duces inhomogeneities in the effective residual risk that in turn 
induce purely stationary biases of the original RMT cleaning. 

We would like to thank Marco Avellaneda and Jim Gatheral 
for their insightful comments and discussion. 
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